Missing or Inapplicable: Treatment of Incomplete Continuous-valued Features in Supervised Learning

نویسندگان

  • Lei Liu
  • Prakash Mandayam Comar
  • Antonio Nucci
  • Sabyasachi Saha
  • Pang-Ning Tan
چکیده

Real-world data are often riddled with data quality problems such as noise, outliers and missing values, which present significant challenges for supervised learning algorithms to effectively classify them. This paper explores the ill-effects of inapplicable features on the performance of supervised learning algorithms. In particular, we highlight the difference between missing and inapplicable feature values. We argue that the current approaches for dealing with missing values, which are mostly based on single or multiple imputation methods, are insufficient to handle inapplicable features, especially those that are continuous valued. We also illustrate how current tree-based and kernelbased classifiers can be adversely affected by the presence of such features if not handled appropriately. Finally, we propose methods to extend existing tree-based and kernel-based classifiers to deal with the inapplicable continuous-valued features.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hash-Based Feature Learning for Incomplete Continuous-Valued Data

Hash-based feature learning is a widely-used data mining approach for dimensionality reduction and for building linear models that are comparable in performance to their nonlinear counterpart. Unfortunately, such an approach is inapplicable to many real-world data sets because they are often riddled with missing values. Substantial data preprocessing is therefore needed to impute the missing va...

متن کامل

A New Algorithm for Optimization of Fuzzy Decision Tree in Data Mining

Decision-tree algorithms provide one of the most popular methodologies for symbolic knowledge acquisition. The resulting knowledge, a symbolic decision tree along with a simple inference mechanism, has been praised for comprehensibility. The most comprehensible decision trees have been designed for perfect symbolic data. Classical crisp decision trees (DT) are widely applied to classification t...

متن کامل

Semi-supervised Learning for Mixed-Type Data via Formal Concept Analysis

• We propose a semi-supervised learning (SSL) method, called SELF (SEmi-supervised Learning via FCA), using Formal Concept Analysis (FCA) – It can handle mixed-type data containing both discrete and continuous variables ∘ Numerical data are discretized by binary encoding / Summary • We propose a semi-supervised learning (SSL) method, called SELF (SEmi-supervised Learning via FCA), using Form...

متن کامل

Bayesian Image Segmentation Using Gaussian Field Priors

The goal of segmentation is to partition an image into a finite set of regions, homogeneous in some (e.g., statistical) sense, thus being an intrinsically discrete problem. Bayesian approaches to segmentation use priors to impose spatial coherence; the discrete nature of segmentation demands priors defined on discrete-valued fields, thus leading to difficult combinatorial problems. This paper p...

متن کامل

Eecient Methods for Dealing with Missing Data in Supervised Learning

We present eecient algorithms for dealing with the problem of missing inputs (incomplete feature vectors) during training and recall. Our approach is based on the approximation of the input data distribution using Parzen windows. For recall, we obtain closed form solutions for arbitrary feedforward networks. For training, we show how the backpropagation step for an incomplete pattern can be app...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013